28 research outputs found

    An algorithm to compute the power of Monte Carlo tests with guaranteed precision

    Full text link
    This article presents an algorithm that generates a conservative confidence interval of a specified length and coverage probability for the power of a Monte Carlo test (such as a bootstrap or permutation test). It is the first method that achieves this aim for almost any Monte Carlo test. Previous research has focused on obtaining as accurate a result as possible for a fixed computational effort, without providing a guaranteed precision in the above sense. The algorithm we propose does not have a fixed effort and runs until a confidence interval with a user-specified length and coverage probability can be constructed. We show that the expected effort required by the algorithm is finite in most cases of practical interest, including situations where the distribution of the p-value is absolutely continuous or discrete with finite support. The algorithm is implemented in the R-package simctest, available on CRAN.Comment: Published in at http://dx.doi.org/10.1214/12-AOS1076 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Consistency of adjacency spectral embedding for the mixed membership stochastic blockmodel

    Full text link
    The mixed membership stochastic blockmodel is a statistical model for a graph, which extends the stochastic blockmodel by allowing every node to randomly choose a different community each time a decision of whether to form an edge is made. Whereas spectral analysis for the stochastic blockmodel is increasingly well established, theory for the mixed membership case is considerably less developed. Here we show that adjacency spectral embedding into Rk\mathbb{R}^k, followed by fitting the minimum volume enclosing convex kk-polytope to the kβˆ’1k-1 principal components, leads to a consistent estimate of a kk-community mixed membership stochastic blockmodel. The key is to identify a direct correspondence between the mixed membership stochastic blockmodel and the random dot product graph, which greatly facilitates theoretical analysis. Specifically, a 2β†’βˆž2 \rightarrow \infty norm and central limit theorem for the random dot product graph are exploited to respectively show consistency and partially correct the bias of the procedure.Comment: 12 pages, 6 figure

    Posterior predictive p-values and the convex order

    Full text link
    Posterior predictive p-values are a common approach to Bayesian model-checking. This article analyses their frequency behaviour, that is, their distribution when the parameters and the data are drawn from the prior and the model respectively. We show that the family of possible distributions is exactly described as the distributions that are less variable than uniform on [0,1], in the convex order. In general, p-values with such a property are not conservative, and we illustrate how the theoretical worst-case error rate for false rejection can occur in practice. We describe how to correct the p-values to recover conservatism in several common scenarios, for example, when interpreting a single p-value or when combining multiple p-values into an overall score of significance. We also handle the case where the p-value is estimated from posterior samples obtained from techniques such as Markov Chain or Sequential Monte Carlo. Our results place posterior predictive p-values in a much clearer theoretical framework, allowing them to be used with more assurance.Comment: 14 pages, 3 figure

    Manifold structure in graph embeddings

    Full text link
    Statistical analysis of a graph often starts with embedding, the process of representing its nodes as points in space. How to choose the embedding dimension is a nuanced decision in practice, but in theory a notion of true dimension is often available. In spectral embedding, this dimension may be very high. However, this paper shows that existing random graph models, including graphon and other latent position models, predict the data should live near a much lower-dimensional set. One may therefore circumvent the curse of dimensionality by employing methods which exploit hidden manifold structure

    Matrix factorisation and the interpretation of geodesic distance

    Full text link
    Given a graph or similarity matrix, we consider the problem of recovering a notion of true distance between the nodes, and so their true positions. We show that this can be accomplished in two steps: matrix factorisation, followed by nonlinear dimension reduction. This combination is effective because the point cloud obtained in the first step lives close to a manifold in which latent distance is encoded as geodesic distance. Hence, a nonlinear dimension reduction tool, approximating geodesic distance, can recover the latent positions, up to a simple transformation. We give a detailed account of the case where spectral embedding is used, followed by Isomap, and provide encouraging experimental evidence for other combinations of techniques

    Implications of sparsity and high triangle density for graph representation learning

    Full text link
    Recent work has shown that sparse graphs containing many triangles cannot be reproduced using a finite-dimensional representation of the nodes, in which link probabilities are inner products. Here, we show that such graphs can be reproduced using an infinite-dimensional inner product model, where the node representations lie on a low-dimensional manifold. Recovering a global representation of the manifold is impossible in a sparse regime. However, we can zoom in on local neighbourhoods, where a lower-dimensional representation is possible. As our constructions allow the points to be uniformly distributed on the manifold, we find evidence against the common perception that triangles imply community structure

    Hierarchical clustering with dot products recovers hidden tree structure

    Full text link
    In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure. We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance. We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model. The key technical innovations are to understand how hierarchical information in this model translates into tree geometry which can be recovered from data, and to characterise the benefits of simultaneously growing sample size and data dimension. We demonstrate superior tree recovery performance with real data over existing approaches such as UPGMA, Ward's method, and HDBSCAN
    corecore